Skip to content

[WIP] feat: add persistent custom operator registry#968

Open
cmgzn wants to merge 15 commits into
datajuicer:mainfrom
cmgzn:feat/persistent-custom-operators
Open

[WIP] feat: add persistent custom operator registry#968
cmgzn wants to merge 15 commits into
datajuicer:mainfrom
cmgzn:feat/persistent-custom-operators

Conversation

@cmgzn
Copy link
Copy Markdown
Collaborator

@cmgzn cmgzn commented Apr 15, 2026

Summary

Add a persistent JSON-based registry (~/.data_juicer/custom_op.json) so that user-defined custom operators survive across processes without requiring re-registration on every run.

Motivation

Previously, custom operators had to be re-loaded via config every time a process started. This made it cumbersome to work with reusable custom ops across sessions, scripts, and CLI invocations.

Changes

  • data_juicer/utils/custom_op.py (new) — Core module for persistent custom op management:

    • JSON registry at ~/.data_juicer/custom_op.json storing source paths keyed by op name
    • load_persistent_custom_ops() replays registrations on startup, auto-cleaning stale entries
    • CLI interface: python -m data_juicer.utils.custom_op {list,register,unregister,reset}
    • Dynamic module/package loading extracted from config.py
  • data_juicer/utils/registry.py — Add unregister_module() to Registry class

  • data_juicer/ops/__init__.py — Call load_persistent_custom_ops() at import time after built-in ops are loaded

  • data_juicer/config/config.py — Replace inline loading logic with a re-export from custom_op for backward compatibility

  • data_juicer/tools/op_search.py — Harden OPRecord to handle custom ops with non-standard module paths, missing source files, and absent test files

  • data_juicer/tools/DJ_mcp_granular_ops.py — Adapt MCP tooling for enhanced OPRecord fields

  • docs/DeveloperGuide.md, docs/DeveloperGuide_ZH.md — Document the new persistent registration workflow

Testing

  • tests/utils/test_custom_op.py
  • tests/tools/test_op_search.py

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a persistent custom operator registry, enabling externally developed operators to be registered and automatically loaded across sessions. Key changes include the addition of data_juicer.utils.custom_op for registry management, an enhanced CLI for the operator search tool, and improved parameter handling in MCP tool generation. Review feedback identifies safety concerns regarding the manual cleanup of sys.modules during operator unregistration and highlights inconsistencies between the documentation and implementation regarding registry filenames and environment variables. Additionally, more robust error handling was recommended for resolving operator source paths.

Comment thread data_juicer/utils/custom_op.py Outdated
Comment thread data_juicer/utils/custom_op.py Outdated
Comment thread docs/DeveloperGuide.md
Comment thread docs/DeveloperGuide_ZH.md
Comment thread data_juicer/tools/op_search.py Outdated
except Exception as e:
# Clean up partially-initialized module to avoid stale entries
sys.modules.pop(module_name, None)
raise RuntimeError(f"Error loading '{abs_path}' as '{module_name}': {e}")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check if we need to rollback OPERATORS here

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added _rollback_operators() helper. Both file and package loading branches now snapshot OPERATORS before loading and roll back on failure.

Copy link
Copy Markdown
Collaborator

@ShenQianli ShenQianli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the persistence model is at the operator level, but the real loading unit is the module/package path -> unreliable for package-based custom ops, relative imports, and multi-operator modules -> i think persistence should be path-based instead.

@cmgzn cmgzn force-pushed the feat/persistent-custom-operators branch from 21f6185 to 1e075cd Compare April 16, 2026 09:36
@cmgzn cmgzn changed the title feat: add persistent custom operator registry [WIP] feat: add persistent custom operator registry Apr 23, 2026
@cmgzn cmgzn force-pushed the feat/persistent-custom-operators branch from ca78d61 to 6828fca Compare May 7, 2026 09:00
cmgzn added 3 commits May 15, 2026 10:54
- Remove redundant _read_registry() call in stale entry cleanup (race condition)
- Restore single-quote style in __all__ to avoid unrelated formatting changes
- Add missing blank lines between test classes (PEP 8)
… custom operations loading and testing isolation.
@cmgzn
Copy link
Copy Markdown
Collaborator Author

cmgzn commented May 18, 2026

the persistence model is at the operator level, but the real loading unit is the module/package path -> unreliable for package-based custom ops, relative imports, and multi-operator modules -> i think persistence should be path-based instead.

The registry is now keyed by registration path (file/directory), not operator name. Operator names are always derived at runtime from the live OPERATORS registry after loading. This handles package-based custom ops, relative imports, and multi-operator modules correctly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants